PROSPER Loan Exploration by Maureen Rocas

Univariate Plots Section

## [1] 113937     81
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

ListingKey is a unique key for each listing. I wonder if all records have unique ListingKeys, let’s check…

## [1] 1698
## [1] "ProsperScore"
##                    ListingKey ListingNumber           ListingCreationDate
## 32681 00223594917038064A7C947        998257 2013-11-15 16:58:37.167000000
## 32682 00223594917038064A7C947        998257 2013-11-15 16:58:37.167000000
## 17275 00473590513960687DD308F        941296 2013-10-07 15:47:36.023000000
## 32965 00473590513960687DD308F        941296 2013-10-07 15:47:36.023000000
## 7479  0098360461900952056DB93       1190614 2014-03-02 14:21:39.583000000
## 33221 0098360461900952056DB93       1190614 2014-03-02 14:21:39.583000000
##       ProsperRating..Alpha. ProsperScore
## 32681                     C            5
## 32682                     C            6
## 17275                     C            6
## 32965                     C            4
## 7479                      E            2
## 33221                     E            4

There are 1698 duplicate records and looks like only ProsperScore is unique among the duplicate records.

## [1] 871

Removing the ProsperScore, I got 871 duplicate records. I’m thinking of focusing my analysis on ProsperRating and will not be looking into ProsperScore. So from the duplicate records, I’ll retain the first entry for each ListingKey and filter out the rest. Also, ProsperRating is only applicable for loans originated after July 2009, so I will filter out pre-July 2009 loans as well.

## [1] TRUE

##     1     2     3     4     5     6     7  NA's 
##  6916  9715 14168 18090 15368 14388  5321   127

Majority of the loans have a ProsperRating of 4, which is an average rating. The best rating, 7, has only over 5000.

I won’t include NA ProsperRating in my analysis, so I will remove them from our sample data.

##   600   620   640   660   680   700   720   740   760   780   800   820 
##  1038  1652  8746 13943 13839 13453 10905  7815  5213  3688  2098  1039 
##   840   860   880 
##   398   122    17

The majority of the borrowers have a CreditScore from 660 to 700.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3427    5000    5931    7083 1750000

There’s more than 1 million difference from 3rd quantile to the max. It’s also hard to see the distribution…

## 
##         394400  416666.666667  466666.666667  483333.333333  618547.833333 
##              1              1              1              1              1 
## 1750002.916667 
##              1
## 
##        0 0.083333     0.25 1.416667 1.666667 1.833333 
##      693        5        1        1        1        2

It’s hard to see the distribution, especially with an outlier in the millionth point. Using log10 certainly helps to see that majority of the borrowers have a monthly income of around $5000. I find the millionth monthly income strange… and there’s only 1. The next income below that outlier is $618,547. Another strange thing are the monthly income that are less than $1. Hmmm, curiouser and curiouser…

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3700    6356    8740   11600   72500   64342

The PrincipalBorrowed distribution is right-skewed, with median in $6356 and maximum in $72500. There’s a peak around $5,000.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1140    4568   11400   13900  498400

The AvailableBankcardCredit’s mean, $11400, is more than twice the median value, $4564. However, looking at the figure above, it is hard to see the median and the mean values but I think it is somewhere within the peak. I’ll transform the x-axis to Log10.

Now, I can definitely see the peak; and the red dashline, which are the median and mean value, respectively, surround the two peaks.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
##    12    36    60 
##  1613 58125 24228

There are 3 terms: 12 months, 36 months, and 60 months. Majority of the borrowers choose the 36 month term loan.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    9062   13500   35000

In the LoanOriginalAmount distribution, you can clearly see the peaks for every $5,000 segment. So for $1,000-$5,000, the peak is $4,000. Likewise, for $26,000 to $30,000, the peak is $30,000. There seem to be a trend in the peaks… 10K, 20K, 25K, 30K, 35K. Do people just like a nice number, divisible by 5? But why 4K? Hmmm, is this worth exploring… or not.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.1259  0.1775  0.1863  0.2474  0.3400

There’s a big peak in 0.30. I wonder what loan amount and term gives this yield.

Now, I’ll look at the next 6 variables which I will refer to as Bad Records and grouped them into three: Inquiries, Delinquencies and Public Records. I’ll also show their summaries, where the first one will pertain to the first plot while the second summary will pertain to the second plot.

I’ll look at the Inquiries first.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.9644  1.0000 27.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.285   6.000  78.000

Both distribution are right-skewed. In the first distribution, majority have zero Inquiries for the last 6 months. While in the second distribution, majority of the loans have had atleast 2 Inquiries.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3236  0.0000 51.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.659   2.000  99.000

Similar to Inquiries, the distribution for Delinquencies are also right-skewed with majority of the loans having zero delinquencies. This is backed by their median values which is zero.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.000000  0.000000  0.000000  0.009254  0.000000 20.000000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.285   0.000  38.000

Again, the distribution for Public Records is right-skewed with majority of the loans having zero public records.

Among the bad record variables, Public Records has the least number, with maximum for the last 10 years is 38 compared to Total Inquiries with 78 and Delinquencies for the past 7 years as 99. This makes sense as Public Records pertains to court order or bankruptcies. With such rare occurence, I wonder what influence the Public Record have on ProsperRating. I’ll look at this in the Bivariate section.

All in all, the distribution for the bad record variables are right-skewed with majority of the data in or near zero point. This clearly shows that most of the borrowers do not have bad record. This is good to know for Prosper investors.

Univariate Analysis

What is the structure of your dataset?

Originally, there are 113,937 loan records with 81 features or variables. However, analyzing 81 features is difficult. So I’ve chosen 14 features and categorized them into 4 groups: Ratings, Borrower Information, Loan Information, and Bad Records.

These are:

Ratings:
  • ProsperRating - The rating assigned at the time the listing is created. Highest is 7 (AA in website), lowest is 1 (HR in website).
  • CreditScoreRangeLower - Shows lower credit rating provided by a consumer credit rating agency.
Borrower Information:
  • StatedMonthlyIncome
  • ProsperPrincipalBorrowed - Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
  • AvailableBankcardCredit - The total available credit via bank card at the time the credit profile was pulled.
Loan Information:
  • LoanOriginalAmount
  • Term - The length of the loan expressed in months.
  • LenderYield - is equal to the interest rate on the loan less the servicing fee.
Bad Records:
  • InquiriesLast6Months and TotalInquiries - Number of inquiries at the time the credit profile was pulled.
  • CurrentDelinquencies and DelinquenciesLast7Years - Number of accounts delinquent at the time the credit profile was pulled
  • PublicRecordsLast12Months and PublicRecordsLast10Years - Number of public records at the time the credit profile was pulled. Public records include bankruptcies, liens, and judgements.

Furthermore, the 113,937 records was trimmed down to 83,966; loans that were made prior to July 2009, those with duplicate Listing Keys and those records with NA ProsperRating were removed.

Observations:
  • Prosper Rating and Credit Score have normal distributions. Prosper Rating peaks at 4 and Credit Score peaks at 660-700
  • The median monthly income is $5,000, 3rd quantile is $7,083. There’s 1 outlier at $1.75 million; removing this outlier, the maximum monthly income is $618,547
  • 75% of the borrowers have AvailableBankardCredit less than or equal to $13,900. While the maximum is $498,400. So 75% belong to just 2.8% of the maximum amount.
  • There are 3 terms: 12 months, 36 months and 60 months. Majority of the borrowers choose 36 month term.
  • The minimum loan is $1,000, median is $7,500 and the maximum is $35,000. There are noticeable peaks for every 5,000 segment: at $4,000, $10,0000, $15,000, $20,000, $25,000, $30,000, and $35,000.
  • The median and 3rd quantile value for public records and current delinquencies is 0. For 7 year delinquencies and 6month inquires, the median is zero, while 3rd quantile is 1 and 2, respectively. Only TotalInquiries have the peak at 2, while the rest have the peak at 0.

What is/are the main feature(s) of interest in your dataset?

I want to look at the data in behalf of a potential investor, thus, I chose ProsperRating as the main feature of interest. ProsperRating is the rating set on the loan listing and it tells a potential investor the potential risk of a loan. I would like to know what are considered risk and its relationship to ProsperRating.

Note that I did not include ProsperScore in my analysis although ProsperScore is one of the main features that is used to calculate ProsperRating. This is because ProsperScore is calculated based on prosper loan history. It tells the risk associated with a borrower. Since I would be including other prosper loan historical details that are maybe used in ProsperScore computation, I thought it would be redundant to include ProsperScore as well.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The other main feature that will help support my investigation is CreditScore, which is given by a credit rating agency. Other variables that I included are:
  • AvailableBankcardCredit and borrower’s income, which tells the borrower’s capacity to pay
  • total principal borrowed, inquiries, delinquent payments, and public records, which gives some information on previous prosper loans the borrower made
  • loan amount and term, which are the features of the loan structure and is the basis for the rate applied to the loan.

Did you create any new variables from existing variables in the dataset?

I created four variables: LoanOrigDate, CurrentBadRecords, LoanAmtByTerm and CreditScoreRange.

I created LoanOrigDate in order to remove July 2009 records.

CurrentBadRecords is the sum of InquiriesLast6Months, CurrentDelinquencies , and PublicRecordsLast12Months.

The LoanAmtByTerm is Loan Amount per Month.

CreditScoreRange, which has 5 values, is a smaller version of CreditScoreRangeLower, which has 15 values.

Below is the mapping for CreditScoreRange:

CreditScoreRangeLower   CreditScoreRange
600, 620, 640           600-659
660, 680                660-699
700, 720                700-739
740, 760, 780           740-799
800, 820, 840, 860      800-899

The frequency of data for each of the CreditScoreRangeLower is considered in deciding the appropriate partitioning used in the mapping. The reason for creating this variable is for visualization purposes.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log transformed the right-skewed StatedMonthlyIncome in order to better see the distribution. The transformed distribution became normal, peaking at $5,000 monthly income. I find unusual the outlier monthly income at $1,750,002 and the income at less than $1. I also log transformed the AvailableBankcardCredit to also better see the distribution and found two peaks that are near or in-between its median and mean value.

I checked whether ListingKey is unique and I found that those records with duplicate ListingKey have duplicate data for all the variables except for ProsperScore. I find this odd and I think there may have been some input error. Since, I will not be analyzing ProsperScore, I’ve decided to only include the first entries for each duplicate record and filter out the rest.

I also filtered out records for loans created before July 2009. This is because ProsperRating is only applicable after July 2009. Furthermore, I’ve also removed the records with NA ProsperRating, since I am focusing my investigation on ProsperRating.

I’ve created a data frame that contains the summaries (quantile values) per ProsperRating of each of the variables or features. This is done to create a summary plot.

Bivariate Plots Section

In the correlation matrix, note that I renamed the variables to a shorter version to make the plot larger.

##              rating cscore income principal bankcard loanamount  term
## rating         1.00   0.55   0.26      0.24     0.47       0.45  0.08
## cscore         0.55   1.00   0.15      0.23     0.55       0.29  0.07
## income         0.26   0.15   1.00      0.30     0.23       0.41  0.07
## principal      0.24   0.23   0.30      1.00     0.27       0.42  0.14
## bankcard       0.47   0.55   0.23      0.27     1.00       0.34  0.06
## loanamount     0.45   0.29   0.41      0.42     0.34       1.00  0.37
## term           0.08   0.07   0.07      0.14     0.06       0.37  1.00
## yield         -0.96  -0.51  -0.26     -0.22    -0.46      -0.42  0.01
## inquiry6mos   -0.26  -0.07   0.11      0.01     0.04      -0.10 -0.05
## inquirytotal  -0.17  -0.14   0.18      0.01     0.04      -0.02 -0.01
## delinq        -0.18  -0.20  -0.03     -0.12    -0.16      -0.15 -0.05
## delinq7yrs    -0.21  -0.30  -0.04     -0.13    -0.28      -0.15 -0.04
## pubrec12mos   -0.05  -0.04   0.00     -0.03    -0.04      -0.03 -0.02
## pubrec10yrs   -0.15  -0.28  -0.09     -0.09    -0.26      -0.11 -0.01
##              yield inquiry6mos inquirytotal delinq delinq7yrs pubrec12mos
## rating       -0.96       -0.26        -0.17  -0.18      -0.21       -0.05
## cscore       -0.51       -0.07        -0.14  -0.20      -0.30       -0.04
## income       -0.26        0.11         0.18  -0.03      -0.04        0.00
## principal    -0.22        0.01         0.01  -0.12      -0.13       -0.03
## bankcard     -0.46        0.04         0.04  -0.16      -0.28       -0.04
## loanamount   -0.42       -0.10        -0.02  -0.15      -0.15       -0.03
## term          0.01       -0.05        -0.01  -0.05      -0.04       -0.02
## yield         1.00        0.25         0.15   0.19       0.20        0.05
## inquiry6mos   0.25        1.00         0.53   0.04       0.08        0.01
## inquirytotal  0.15        0.53         1.00   0.03       0.10        0.01
## delinq        0.19        0.04         0.03   1.00       0.37        0.06
## delinq7yrs    0.20        0.08         0.10   0.37       1.00        0.07
## pubrec12mos   0.05        0.01         0.01   0.06       0.07        1.00
## pubrec10yrs   0.14        0.05         0.08   0.11       0.36        0.18
##              pubrec10yrs
## rating             -0.15
## cscore             -0.28
## income             -0.09
## principal          -0.09
## bankcard           -0.26
## loanamount         -0.11
## term               -0.01
## yield               0.14
## inquiry6mos         0.05
## inquirytotal        0.08
## delinq              0.11
## delinq7yrs          0.36
## pubrec12mos         0.18
## pubrec10yrs         1.00

The strongest correlation is ProsperRating vs LenderYield at -0.96, next is CreditScore at 0.55, followed by BankcardCredit and LoanAmount. Other variables that show atleast 20% positive correlation are StatedMonthlyIncome and PrincipalAmount. Inquiry6Months and Delinquencies7Yrs also show atleast 20% correlation, but in the negative. All the bad record variables are in the negative correlation.

I am wondering about LenderYield, it has an almost perfect correlation with ProsperRating. So, I looked into Prosper website and found out that LenderYield is the borrower’s interest rate less lender’s servicing fee. And the Borrower’s rate is I think based on ProsperRating. In short, LenderYield may be dependent on ProsperRating. I will not be including this in my analysis because it is not a predictor variable for ProsperRating.

## prosper$ProsperRating: 1
##  600  620  640  660  680  700  720  740  760  780  800  820  840  860  880 
##  219  290 1049 1563 1466 1217  554  304  162   57   23   10    1    1    0 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##  600  620  640  660  680  700  720  740  760  780  800  820  840  860  880 
##  571  782 2442 2431 1747  851  421  234  143   63   19    7    2    2    0 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##  600  620  640  660  680  700  720  740  760  780  800  820  840  860  880 
##  214  503 2669 3147 2452 2075 1500  854  455  189   74   32    3    1    0 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##  600  620  640  660  680  700  720  740  760  780  800  820  840  860  880 
##   18   42 1807 4063 3959 3621 2338 1117  580  358  136   38    9    3    1 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##  600  620  640  660  680  700  720  740  760  780  800  820  840  860  880 
##   16   35  650 1950 2638 3177 2798 2164 1210  443  194   71   17    5    0 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##  600  620  640  660  680  700  720  740  760  780  800  820  840  860  880 
##    0    0  129  789 1570 2284 2709 2413 1896 1513  685  285   98   15    2 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##  600  620  640  660  680  700  720  740  760  780  800  820  840  860  880 
##    0    0    0    0    7  228  585  729  767 1065  967  596  268   95   14

The ProsperRating vs CreditScore plot shows an increasing trend. The data points gets bigger as it moves laterally along the CreditScoreRange and diagonally along the ProsperRating. Majority of the points are in the middle of the CreditScoreRange. There are no datapoints in the lowest CreditScore range for the top 2 ProsperRatings, 6th and 7th; and no datapoints in the highest CreditScore for the bottom 3: 1st, 2nd and 3rd.

I want to look at the ProsperRating density plot by CreditScoreRange to better see the peaks in the CreditScore for each ProsperRating. This time, I will use 5 groups instead of the 15 CreditScoreRangeLower values.

In this plot, I can clearly see in how CreditScore peaks changes as ProsperRating improves. The lowest CreditScore peaks in the bottom 3 ProsperRating while the highest CreditScore peaks in the top two ProsperRating.

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2500    3871    4852    5774 1750000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2917    4167    5023    6083  483300 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3083    4333    5185    6250  618500 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3500    5000    5825    7000  394400 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3917    5417    6442    7750  108800 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4167    5833    6744    8167   91670 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4833    6667    7663    9167  108300

I’ve removed the top 5% of the StatedMonthlyIncome data in order to better see the relationship between Income and Rating. What I notice immediately are the horizontal lines that tells me income is quite constant throughout the rating range. Even with overplotting, I can still distinguish the vertical lines. So, same income can be seen in the bottom rating up to the top rating but there is overplotting in the middle of the income range compared to its lower and higher range. In the 1st rating, although I still see the horizontal lines in the higher income range, there are less of them and as the rating goes up, more points can be seen in that side than in the lower side of the income range.

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    2500    4500    6375    8000   44800    5635 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    3000    4550    6502    8000   50900    7155 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    3000    5000    7130    9000   54220   10827 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    4000    6381    8643   11500   72500   14624 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    4000    7500    9418   13000   53200   12112 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    4137    8000   10480   15000   67000   10018 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    5000   10000   12200   17100   55900    3971

There area a lot of outliers, but I can clearly see that the PrincipalBorrowed median increases as the rating increases.

## Warning: Removed 2420 rows containing non-finite values (stat_boxplot).

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     194    1233    4363    4492  150500 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     332    1500    4223    4546  395500 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0    597.8   2430.0   6049.0   6902.0 412800.0 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1154    3914    8317   10020  373300 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2042    6262   12070   15650  360000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4344   11620   18870   25490  406100 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   15030   28140   36230   48560  498400

There are more outliers in the beginning of the AvailableBankcardCredit range than at the end. Same with PrincipalBorrowed, the AvailableBankcardCredit median also increases as rating increases. However, there is a big difference in the median of the top ProsperRating compared with the rest; 7th rating is 59% more than the 6th rating.

## prosper$ProsperRating: 1
## 
##   36 
## 6916 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
## 
##   12   36   60 
##  288 7556 1871 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
## 
##   12   36   60 
##  295 9756 4117 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
## 
##    12    36    60 
##   264 10041  7785 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
## 
##   12   36   60 
##  316 8888 6164 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
## 
##    12    36    60 
##   269 10526  3593 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
## 
##   12   36   60 
##  181 4442  698

Looking at the density plot above, the 12th and 36th Term is quite constant throughout the ProsperRating. Also, the lowest ProsperRating only have 36 term. For the 60 Term, there’s a peak in the 4th rating.

On another note, the correlation of Term with ProsperRating is weak at 0.08. On its own, it has a weak influence and I am planning on using it with LoanOriginalAmount to get the Loan value per Month.

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4000    3463    4000   16800 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3522    4000    4576    5000   15900 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6001    7075   10000   15000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5000   10000   10380   15000   25000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    6000   10000   11600   15000   35000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5800   10000   11450   15000   35000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5900   10570   11570   16000   35000

I didn’t expect to see a clear trend between ProsperRating and LoanAmount, but the plot definitely shows that as loan amount increases, the rating increases. I am equating ProsperRating to risk, so I thought that smaller loan amount would have lower risk, and thus should have high rating. But that is not being shown here. Hmmm…

The median loan amount of the 1st, 2nd, and 3rd ratings are way smaller compared with the top 4 ratings. There’s is a 40% difference in the median loan amount between the 3rd and 4th rating. I can see a clear separation of the bottom 3 ratings from the rest.

Now, let’s look at the bad record variables. All these have negative correlations with ProsperRating. Similar to the Univariate section, I will group these variables into three: Inquiries, Delinquencies and Public Records.

## Warning: Removed 86 rows containing non-finite values (stat_boxplot).
## Warning: Removed 413 rows containing non-finite values (stat_boxplot).

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.853   3.000  22.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.448   2.000  27.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.099   2.000  12.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.9104  1.0000 12.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7213  1.0000 11.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5897  1.0000 12.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4676  1.0000  7.0000
## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   5.825   8.000  66.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   5.355   7.000  71.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   4.491   6.000  78.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    3.00    4.28    6.00   44.00 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.803   5.000  32.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.561   5.000  28.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.148   4.000  36.000

In both plots, I can see a decreasing trend. The topmost Rating has the lesser amount of datapoints compared to the lower Ratings. This is backed by the median value for both variables; the median decreases as rating increases.

## Warning: Removed 145 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3339 rows containing non-finite values (stat_boxplot).

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5844  1.0000 22.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5996  1.0000 32.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4216  0.0000 21.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2896  0.0000 21.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2334  0.0000 51.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1546  0.0000 21.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.053   0.000  10.000
## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.947   5.000  99.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.779   6.000  99.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.815   4.000  99.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.903   3.000  99.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.084   1.000  99.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.946   0.000  99.000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.508   0.000  71.000

The median for both variables is zero for each ProsperRating. For the CurrentDelinquencies, the 3rd quantiles for Rating 1 and 2 is higher by 1 point compared to the rest of the PropserRating, which is zero. For DelinquenciesLast7Years, the 2nd ProsperRating has a higher 3rd quantile value than the 1st ProsperRating.

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01764 0.00000 3.00000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01379 0.00000 4.00000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01369 0.00000 4.00000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.000000  0.000000  0.000000  0.008734  0.000000 20.000000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.006442 0.000000 4.000000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.004379 0.000000 4.000000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.001316 0.000000 2.000000
## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3242  1.0000 11.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4072  1.0000 34.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3518  1.0000 16.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3165  1.0000 25.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2672  0.0000 16.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1802  0.0000 38.0000 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06108 0.00000 8.00000

I mentioned in the Univariate section that the occurence of PublicRecords is rare and this is definitely shown in the plots above where the quantiles are mostly in the zero point. Furthermore, the correlation with ProsperRating for PublicRecordsLast12Months and PublicRecordsLast10Years is -0.05 and -0.15, respectively. This definitely show a weak relationship which tells me that PublicRecord variable have less influence on the ProsperRating. I’ll check this more in the Linear Regression.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

ProsperRating correlates the strongest with CreditScore, followed by AvailableBankcardCredit and LoanAmount.

In the CreditScore vs ProsperRating, an increasing trend is seen; as CreditScore increases so does the ProsperRating.

In the StatedMonthlyIncome vs ProsperRating, majority of the datapoints are in the middle of the Income range. There are lots of vertical lines which suggest that income is fairly constant and the same income can be seen in all of the ratings. The difference in the median between each ProsperRating is not that high; the biggest difference is from ProsperRating 6th to 7th which is 834. Nevertheless, the median Income still increases as ProsperRating improves.

Similar with StatedMonthlyIncome, both the median value of ProsperPrincipalBorrowed and AvailableBankcardCredit increases as ProsperRating improves. Furthermore, the median of the AvailableBankcardCredit for the 7th Rating is more than twice the median of the 6th Rating.

The LoanOriginalAmount median also increases as ProsperRating increases. Lower loan amount have lower ProsperRating than loans with higher amount.

All inquiries, delinquencies and public record features have a negative relationship with ProsperRating; as Rating increases, frequency of bad records decreases.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The AvailableBankcardCredit correlates well with CreditScore at 0.55, which makes sense as they are both related to a borrower’s credit standing.

Both ProsperPrincipalBorrowed and StatedMonthlyIncome correlates more strongly with LoanOriginalAmount at 0.42 and 0.41, respectively than with ProsperRating. The LoanOriginalAmount eventually becomes ProsperPrincipalBorrowed, so its strong correlation is expected. For StatedMonthlyIncome, I guess the amount a borrower loans is a proportion of his income; the higher his income, the higher the amount that he can loan.

The CurrentDelinquencies, DelinquenciesLast7Years and PublicRecordsLast10Years have a stronger correlation with CreditScore compared with ProsperRating, at -0.20, -0.30, and -0.28, respectively. So it seems that these variables have more influence on the CreditScore than with ProsperRating; unlike with Inquiries where it has greater correlation or influence with ProsperRating than CreditScore.

What was the strongest relationship you found?

CreditScore shows the strongest relationship with ProsperRating at positive correlation of 0.55, followed by AvailableBankcardCredit at 0.47 and LoanOriginalAmount at 0.45.

Multivariate Plots Section

There are 4 main loan feature with positive correlation, not including Term. These are StatedMonthlyIncome, AvailableBankcardCredit, LoanOriginalAmount, and ProsperPrincipalBorrowed. First, I want to see the relationship of these variables with ProsperRating and CreditScore. Later, I will look at how these positive-correlated variables relate to the Bad Record variables.

I removed the top 5% and the relationship that pops out more is ProsperRating vs CreditScore where higher ratings have higher CreditScore. The StatedMonthlyIncome is quite constant where the same horizontal lines can be seen in all the ProsperRating. What I can conclude in this plot is that CreditScore influences the ProsperRating more than the StatedMonthlyIncome.

Unlike the StatedMonthlyIncome, I can see the influence of AvailableBankcardCredit with CreditScore and ProsperRating. Lower CreditScore can be seen in the lower range of BankcardCredit and higher CreditScore is seen in higher BankcardCredit. Furthermore, the BankcardCredit variance increases as the ProsperRating increases.

Here, I can see that lower ProsperRating have lower CreditScore and higher ProsperRating have higher CreditScore.

Now, I want to look at the Loan per Month by dividing LoanOriginalAmount by Term. There is a variable that actually depicts what the borrower pay per month; this is MonthlyLoanPayment. I actually thought of using this instead of Loan Amount but I don’t want to add any bias that may incur because of rates applied to the loan.

## prosper$ProsperRating: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.78   83.33  111.10   96.19  111.10  466.70 
## -------------------------------------------------------- 
## prosper$ProsperRating: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.78   69.44  111.10  122.10  138.90  833.30 
## -------------------------------------------------------- 
## prosper$ProsperRating: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.78   97.22  152.20  171.20  236.10 1250.00 
## -------------------------------------------------------- 
## prosper$ProsperRating: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.78  125.00  216.70  232.30  277.80 2083.00 
## -------------------------------------------------------- 
## prosper$ProsperRating: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.78  142.40  250.00  262.80  333.30 2083.00 
## -------------------------------------------------------- 
## prosper$ProsperRating: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.78  146.00  250.00  283.00  416.30 2083.00 
## -------------------------------------------------------- 
## prosper$ProsperRating: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.78  166.70  277.80  308.60  416.70 2083.00

The plot above is similar to the LoanOriginalAmount vs ProsperRating by CreditScore. High ProsperRating have high CreditScore. Furthermore, the median amount for LoanAmtByTerm also increases as ProsperRating increases.

## Warning: Removed 64342 rows containing missing values (geom_point).

Again, I can see the relationship of ProsperRating and CreditScore; higher ProsperRating have higher CreditScore. This relationship is apparent in all the positive-correlated plots above.

Now, I want to see the relationship of the negative-correlated features, which I call Bad Record variables, with CreditScore. I am thinking of just using the bad record values incurred recently rather than the Overall Total Bad Record values. I think that the current values have more impact than the overall total. Among the bad record variables, InquiriesLast6Months have the strongest correlation at -0.26, next is -0.21 for Delinquency7Years followed by CurrentDelinquencies at -0.18. There is just a .03 difference in the Delinquency correlation values, so I think it would be enough just to use the recent/current values rather than using the overall total. Before, I disregard the past inquires, delinquencies and public records, let’s compare their R-squared value to check if its enough to only use the current bad record values.

## 
## Calls:
## m1: lm(formula = as.numeric(ProsperRating) ~ I(-InquiriesLast6Months - 
##     CurrentDelinquencies - PublicRecordsLast12Months), data = prosper)
## m2: lm(formula = as.numeric(ProsperRating) ~ I(-TotalInquiries - 
##     CurrentDelinquencies - DelinquenciesLast7Years - PublicRecordsLast12Months - 
##     PublicRecordsLast10Years), data = prosper)
## 
## ===================================================================================================================================================
##                                                                                                                                 m1          m2     
## ---------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept)                                                                                                                   4.417***    4.349*** 
##                                                                                                                              (0.007)     (0.007)   
## I(-InquiriesLast6Months - CurrentDelinquencies - PublicRecordsLast12Months)                                                   0.268***             
##                                                                                                                              (0.003)               
## I(-TotalInquiries - CurrentDelinquencies - DelinquenciesLast7Years - PublicRecordsLast12Months - PublicRecordsLast10Years)                0.033*** 
##                                                                                                                                          (0.001)   
## ---------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared                                                                                                                         0.086       0.046
## adj. R-squared                                                                                                                    0.086       0.046
## sigma                                                                                                                             1.602       1.637
## F                                                                                                                              7897.006    4016.836
## p                                                                                                                                 0.000       0.000
## Log-likelihood                                                                                                              -158698.926 -160510.812
## Deviance                                                                                                                     215424.559  224925.309
## AIC                                                                                                                          317403.852  321027.624
## BIC                                                                                                                          317431.866  321055.638
## N                                                                                                                             83966       83966    
## ===================================================================================================================================================

Look’s like having the total bad records have lesser influence on the result of the ProsperRating. The current bad records have 0.086 R-squared compared to 0.046 with total bad records. So, I will just use the Current Bad Record values. Also, I would just like to compare the positive-correlated variables with the combination or the total Bad Record values. So, I will add them up to one variable named CurrentBadRecords.

## Warning: Removed 17787 rows containing missing values (geom_point).

## prosper$CreditScoreRange: 600-659
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.908   3.000  34.000 
## -------------------------------------------------------- 
## prosper$CreditScoreRange: 660-699
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.416   2.000  24.000 
## -------------------------------------------------------- 
## prosper$CreditScoreRange: 700-739
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.145   2.000  51.000 
## -------------------------------------------------------- 
## prosper$CreditScoreRange: 740-799
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.006   1.000  22.000 
## -------------------------------------------------------- 
## prosper$CreditScoreRange: 800-899
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.8345  1.0000 10.0000

I have trimmed down the CurrentBadRecords to 35 to better see the plot, as only CreditScoreRange 700-739 have a BadRecord count higher than 35. Majority of lowest ProsperRating can be found in the lowest CreditScoreRange and as CreditScore improves, ProsperRating improves and Bad Record count decreases. Furthermore, majority of loans with high Bad Record count have lower ProsperRating.

Now, let’s look at the relationship of the positive-correlated features with the CurrentBadRecord variable. Let’s start with Loan Amount by Term, faceted with CreditScoreRange.

Here, I can see smaller loan term value tend to have higher inquiries, delinquencies and public records. Furthermore, lower loan value also have low ProsperRating and more lower ProsperRating can be found in the lower CreditScore.

I remember that I was surprised to find out that lower Loan Amount have lower ProsperRating, and this plot certainly support this relationship. So, there is more risk in investing in lower loan amounts.

Again here, I can see that high AvailableBankcardCredit tend to have high ProsperRating, at the same time, they also have lower inquiries, delinquencies, and public records. Also, I can see that as CreditScore increases, the variance in the BankcardCredit also increases. Once more, low ProsperRating have higher risk; low ratings have more bad records associated with it and they tend to have lower AvailableBankcardCredit and CreditScore as well.

This plot is showing 95% of the StatedMonthlyIncome datapoints, removing the 1 million outlier. I can clearly see that the low ProsperRating shows more at the lower CreditScore and those with higher bad records also have lower PropserRating.

## Warning: Removed 6206 rows containing missing values (geom_point).
## Warning: Removed 22069 rows containing missing values (geom_point).
## Warning: Removed 19735 rows containing missing values (geom_point).
## Warning: Removed 13361 rows containing missing values (geom_point).
## Warning: Removed 2971 rows containing missing values (geom_point).

Again here, most lower ProsperRating appear at lower CreditScore and they have higher Bad Records. Also, most of the lower ProsperRating appear at the lower-end of the ProsperPrincipalBorrowed range. There is more risk in investing to loans which have lower previous loans.

I remember that LoanOriginalAmount have a higher correlation with ProsperPrincipalBorrowed than with ProsperRating. This makes sense as LoanAmount eventually becomes ProsperPrincipalBorrowed. Lower LoanAmount also have lower ProsperRating, same with lower PrincipalProsperBorrowed which also have lower ProsperRating.

What the previous plots have shown is that low income, low loan value, low bankcard credit and low principal borrowed have a higher risk. Higher risk means higher number of bad records and low ProsperRating. They all displayed the same trend and relationship with ProsperRating. So I think we can use linear model to predict ProsperRating.

But first, let’s look at the summaries of all the features for each ProsperRating.

The diagram above shows the values for each Quantile at 0.25, 0.5 or median, and 0.75. Comparing the quantile values for each ProsperRating, we can see that better ProsperRating have higher values than lower ProsperRating.

Now, let’s do our model for predicting ProsperRating. I am going to add in the individual features, except for LoanOriginalAmount where I will apply the LoanValue per Term.

## 
## Calls:
## m1: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower, 
##     data = prosper)
## m2: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower + 
##     I(LoanOriginalAmount/Term), data = prosper)
## m3: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower + 
##     I(LoanOriginalAmount/Term) + AvailableBankcardCredit, data = prosper)
## m4: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower + 
##     I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed, 
##     data = prosper)
## m5: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower + 
##     I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed + 
##     StatedMonthlyIncome, data = prosper)
## m6: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower + 
##     I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed + 
##     StatedMonthlyIncome + I(-InquiriesLast6Months), data = prosper)
## m7: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower + 
##     I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed + 
##     StatedMonthlyIncome + I(-InquiriesLast6Months) + I(-CurrentDelinquencies), 
##     data = prosper)
## m8: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower + 
##     I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed + 
##     StatedMonthlyIncome + I(-InquiriesLast6Months) + I(-CurrentDelinquencies) + 
##     I(-PublicRecordsLast12Months), data = prosper)
## 
## ==============================================================================================================================
##                                    m1          m2          m3          m4          m5          m6          m7          m8     
## ------------------------------------------------------------------------------------------------------------------------------
## (Intercept)                      4.711***    3.952***    3.746***    4.375***    4.330***    4.438***    4.455***    4.455*** 
##                                 (0.025)     (0.025)     (0.025)     (0.084)     (0.084)     (0.080)     (0.080)     (0.080)   
## CreditScoreRangeLower: .L        5.926***    5.155***    4.370***    4.947***    4.937***    4.445***    4.383***    4.380*** 
##                                 (0.153)     (0.144)     (0.144)     (0.505)     (0.504)     (0.484)     (0.483)     (0.483)   
## CreditScoreRangeLower: .Q       -0.729***   -0.497**    -0.806***   -1.511**    -1.516**    -1.602**    -1.579**    -1.578**  
##                                 (0.166)     (0.156)     (0.155)     (0.559)     (0.559)     (0.536)     (0.535)     (0.535)   
## CreditScoreRangeLower: .C       -0.299      -0.265      -0.336*     -0.105      -0.095      -0.152      -0.167      -0.167    
##                                 (0.158)     (0.148)     (0.147)     (0.537)     (0.536)     (0.514)     (0.514)     (0.514)   
## CreditScoreRangeLower: ^4       -0.067      -0.043      -0.115       0.126       0.151       0.113       0.118       0.118    
##                                 (0.139)     (0.131)     (0.130)     (0.467)     (0.467)     (0.448)     (0.447)     (0.447)   
## CreditScoreRangeLower: ^5       -0.063       0.004      -0.068      -0.142      -0.107      -0.132      -0.133      -0.133    
##                                 (0.118)     (0.111)     (0.110)     (0.376)     (0.376)     (0.361)     (0.360)     (0.360)   
## CreditScoreRangeLower: ^6        0.186       0.157       0.114       0.291       0.318       0.319       0.321       0.321    
##                                 (0.096)     (0.090)     (0.090)     (0.285)     (0.285)     (0.273)     (0.272)     (0.272)   
## CreditScoreRangeLower: ^7       -0.041      -0.038      -0.075      -0.026      -0.002      -0.001      -0.001      -0.001    
##                                 (0.077)     (0.073)     (0.072)     (0.207)     (0.207)     (0.198)     (0.198)     (0.198)   
## CreditScoreRangeLower: ^8        0.108       0.111       0.097       0.154       0.174       0.196       0.198       0.198    
##                                 (0.062)     (0.058)     (0.058)     (0.148)     (0.148)     (0.142)     (0.142)     (0.142)   
## CreditScoreRangeLower: ^9       -0.083      -0.059      -0.065       0.036       0.046       0.066       0.065       0.066    
##                                 (0.049)     (0.046)     (0.046)     (0.108)     (0.108)     (0.103)     (0.103)     (0.103)   
## CreditScoreRangeLower: ^10       0.042       0.045       0.044       0.055       0.057       0.064       0.064       0.064    
##                                 (0.039)     (0.037)     (0.037)     (0.080)     (0.080)     (0.077)     (0.077)     (0.077)   
## CreditScoreRangeLower: ^11       0.116***    0.110***    0.108***    0.214***    0.213***    0.219***    0.219***    0.219*** 
##                                 (0.031)     (0.029)     (0.029)     (0.061)     (0.061)     (0.059)     (0.059)     (0.059)   
## CreditScoreRangeLower: ^12       0.022       0.028       0.028       0.034       0.035       0.028       0.028       0.028    
##                                 (0.025)     (0.024)     (0.024)     (0.048)     (0.048)     (0.046)     (0.046)     (0.046)   
## CreditScoreRangeLower: ^13      -0.042*     -0.021      -0.020       0.071       0.070       0.065       0.065       0.066    
##                                 (0.021)     (0.019)     (0.019)     (0.039)     (0.039)     (0.038)     (0.038)     (0.038)   
## CreditScoreRangeLower: ^14      -0.025      -0.020      -0.019      -0.000       0.001      -0.003      -0.003      -0.003    
##                                 (0.017)     (0.016)     (0.016)     (0.033)     (0.033)     (0.031)     (0.031)     (0.031)   
## I(LoanOriginalAmount/Term)                   0.003***    0.003***    0.002***    0.002***    0.002***    0.002***    0.002*** 
##                                             (0.000)     (0.000)     (0.000)     (0.000)     (0.000)     (0.000)     (0.000)   
## AvailableBankcardCredit                                  0.000***    0.000***    0.000***    0.000***    0.000***    0.000*** 
##                                                         (0.000)     (0.000)     (0.000)     (0.000)     (0.000)     (0.000)   
## ProsperPrincipalBorrowed                                             0.000***    0.000***    0.000***    0.000***    0.000*** 
##                                                                     (0.000)     (0.000)     (0.000)     (0.000)     (0.000)   
## StatedMonthlyIncome                                                              0.000***    0.000***    0.000***    0.000*** 
##                                                                                 (0.000)     (0.000)     (0.000)     (0.000)   
## I(-InquiriesLast6Months)                                                                     0.202***    0.202***    0.202*** 
##                                                                                             (0.005)     (0.005)     (0.005)   
## I(-CurrentDelinquencies)                                                                                 0.048***    0.048*** 
##                                                                                                         (0.006)     (0.006)   
## I(-PublicRecordsLast12Months)                                                                                        0.067    
##                                                                                                                     (0.043)   
## ------------------------------------------------------------------------------------------------------------------------------
## R-squared                            0.303       0.383       0.392      0.530       0.531       0.569       0.570       0.570 
## adj. R-squared                       0.303       0.383       0.392      0.530       0.531       0.568       0.570       0.570 
## sigma                                1.398       1.316       1.306      1.172       1.171       1.123       1.121       1.121 
## F                                 2612.250    3481.109    3389.772   1301.593    1233.484    1360.465    1299.553    1237.869 
## p                                    0.000       0.000       0.000      0.000       0.000       0.000       0.000       0.000 
## Log-likelihood                 -147291.529 -142167.300 -141548.957 -30943.710  -30925.677  -30104.948  -30074.008  -30072.821 
## Deviance                        164168.955  145305.912  143181.469  26911.023   26861.611   24706.164   24628.380   24625.402 
## AIC                             294615.059  284368.601  283133.914  61925.419   61891.354   60251.896   60192.015   60191.642 
## BIC                             294764.469  284527.350  283302.001  62075.225   62049.044   60417.471   60365.474   60372.986 
## N                                83966       83966       83966      19624       19624       19624       19624       19624     
## ==============================================================================================================================

The overall R-squared is 0.57, which is good; not very strong but there is a 57% chance that our model will hold true for predicting the ProsperRating. Also, the difference in adding StatedMonthlyIncome is just 0.1%, also similar is CurrentDelinquencies. Also, looks like PublicRecordsLast12Months has no impact on ProsperRating at all.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Loans with higher ProsperRating have higher CreditScore and loans with lower ProsperRating have lower CreditScore. This relationship resonates in the features with positive correlation: LoanAmount per Term, StatedMonthlyIncome, AvailableBankcardCredit and ProsperPrincipalBorrowed.

Furthermore, higher ProsperRating have higher medians in all the features, except for the features with negative correlation which are Inquiries, Delinquencies and Public Records.

On the other hand, loans with higher number of Inquiries, Delinquencies and Public Records have lower ProsperRating than loans with lower number of Inquiries, Delinquencies and Public Records.

Majority of loans with low ProsperRating exhibits low value in Loan Amount, Bankcard Credit, Income and Principal Borrowed, at the same time they also have higher Inquiries, Delinquencies and Public Records, than loans with better ProsperRating.

Were there any interesting or surprising interactions between features?

I find interesting the relationship between Loan Amount and ProsperRating; lower Loan Amounts tend to have lower ProsperRating. I find this odd as I thought that there is low risk in investing to such loans. But low loan amounts also have low Credit Score and higher number of Inquiries, Delinquencies and Public Records.

Re-thinking about this relationship and the demographics of borrowers for low loan amounts, I realized that if you borrow a small amount, then that is probably the amount that you can pay. You may have a low Bankcard credit and low income, and you may not be able to go back to Prosper and borrow more. Furthermore, if you are low in cash, then the possibility of being delinquent, inspite of your loan being low, is high. Also, come to think about it, If I have two loans, one is big and the other is small. I would first take care of paying the higher loan than the lower loan because the rate of the higher loan is high.

In retrospect, now it makes sense the relationship of Loan Amount with ProsperRating. There are more risk in investing with lower loan amounts than with higher loan amounts.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model with dependent variable as the ProsperRating and the predictors as the Credit Score, Loan Amount per Term, Income, Prosper Principal Borrowed, and the Current Inquiries, Delinquencies and Public Records.

The variables in the linear model only account for 57% of the variance in the Prosper Rating. This also means that there is a 57% chance that the model will hold true for predicting the Prosper Rating.

One of the model’s limitation is the huge amount of variables that are not analyzed for predicting ProsperRating. In fact, only 12 out of 80 (excluding Prosper Rating) were analyzed and only 9 were included in this model. Another limitation would be the type of regression used. Linear regression may not be the best model for predicting an ordinal variable, such as ProsperRating. There may be a better regression model to use, such as Ordinal Regression.


Final Plots and Summary

Plot One

Description One

This plot shows how CreditScore peak changes as ProsperRating improves. The lowest CreditScore peaks at ProsperRating 1-3 and then gets lower as ProsperRating improves, there is even no 600-659 CreditScore in the topmost ProsperRating. Similarly, the highest CreditScore has the lowest density in the ProsperRating 1, then goes up as ProsperRating improves where it peaks at ProsperRating 6 and 7.

Loans with low ProsperRating have low CreditScores; and as CreditScore improves so does the ProsperRating.

Plot Two

Description Two

This plot shows the quantile summaries per ProsperRating for each of the Positive-Correlated variables: Available Bankcard Credit, Monthly Income, Loan Amount, Loan Amount By Term, and Principal Borrowed.

Generally, we can see that better ProsperRating have higher values at 0.25, 0.50 or median and 0.75 quantiles than the lower ProsperRating. Furthermore, this increasing relationship is more pronounced in the Bankcard variable at all Quantile summary while in the other variables, the increasing value is more seen in the 0.75 quantile than in the 0.50 (median) and 0.25. This tells me that variance is greater between ProsperRating at 0.75 quantile mark.

Plot Three

Description Three

Majority of the loans in the lower CreditScore range have lower ProsperRating and as CreditScore improves, so does the ProsperRating. Furthermore, lower ProsperRating have higher numbers of Current Bad Records (Inquiries, Delinquencies and Public Records) and these lower ProsperRating appear at the lower-end of the Loan value by Term. As the loan value gets bigger, bad record count decreases and ProsperRating improves.

Loan Amount and Term are two of the main properties of a loan. This plot clearly shows that lower Loan Value by Term have more risk than higher loan value.


Reflection

The Prosper data set originally contains 113,937 records with 81 features and I was able to trim it down to 83,966 records; removing loans that were made prior to July 2009, those with duplicate ListingKeys and those with NA ProsperRating.

There are a lot of challenges I faced with this study. One is not being familiar with Prosper, so going through their website and reading information on Prosper listing certainly helps, especially in understanding the different features of the Prosper loan.

Another challenge is choosing which feature is the most interesting. The problem with Prosper is that it has many interesting features, so it is hard to choose which one is the most interesting, like ProsperScore, Borrower’s Rate, Lender Yield. However, what I mostly see in the Prosper website is the Prosper Rating. I also like the fact that Prosper Rating is a measurement of a Loan’s risk. So this made me interested on the different features that may make a loan risky.

The most difficult task I find in this study is choosing 10-15 features out of 81. There are some features that may refer to the same thing or the same concept. Like AvailableBankcardCredit, RevolvingCreditBalance and BankcardUtilization. The last two has something to do with how you use your credit card, but I chose the first one because it is the one that I am most familiar with. One premise I follow is to choose features that I am most familiar or those that I can easily understand. Another premise that I follow is that the features should come from independent observations or not computed within the Prosper environment. At first, I included ProsperScore but I realized that this is computed based on other features in the loan. Same thing with Lender Yield, which I included in the early days of this study. LenderYield has a 96% correlation with ProsperRating, so its almost perfect. It made me wonder if I should include it as it will drive my linear model. So, I read more on its definition from the website and concluded to remove it from my analysis as LenderYield’s value may be dependent on ProsperRating, not the other way around. So finally, I came up with 13 features, including ProsperRating, that I am happy with.

The ProsperRating risk relationship made sense to me when I got into the Multivariate section. Here, I can finally see how ProsperRating is a measure of risk in the Bad Record plots against the Loan Amount per Term, Principal Borrowed, Bankcard Credit and Income. These plots clearly shows Low Prosper Rating have higher risk. I especially like how the Bad Record vs Loan Value per Term validated the relationship shown in the plot for LoanOriginalAmount vs ProsperRating, wherein low loan amounts have lower Prosper Rating. The Bad Record vs Loan Value per Term supported the risk involved with lower loan amounts and somehow changed my perception on investing to lower loan amounts as oppose to investing to higher loan amounts in Prosper.

I would be interested in exploring Ordinal Regression and how it could be use to predict ProsperRating; finding out if it is a better model in predicting ProsperRating than Linear Regression. I would also be interested in looking at the other independent variables available in Prosper, which feature can improve the model and give greater measure of predicting Prosper Rating.

Acknowledgements

Below are the list of websites I referred to in order to complete this study: